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Abstract 



In this Communication we present statistical analysis of conservation profiles in families 
of homologous sequences for nine proteins whose folding nucleus was determined by protein 
engineering methods. We show that in all but one protein (AcP) folding nucleus residues 
are significantly more conserved than the rest of the protein. Two aspects of our study are 
especially important: 1) grouping of amino acids into classes according to their physical- 
chemical properties and 2) proper normalization of amino acid probabilities that reflects 
the fact that evolutionary pressure to conserve some amino acid types may itself affect 
concentration of various amino acid types in protein families. Neglect of any of those two 
factors may make physical and biological "signals" from conservation profiles disappear. 



Introduction 



It is now widely accepted that folding of small single-domain proteins follows "nucleation- 
condensation" mechanism flAbkevich et al., 1994j ; [Itzhaki et ai, 19951 ; [Fersht, 1997) ; [Bhakhnovich 
199'|; puo fc Thirumalai, 1995| ; [Pande et ai, 1998]) whereby relatively small fragment of pro- 
tein structure is formed in the transition state between unfolded and folded states. Residues 
belonging to this fragment constitute specific folding nucleus (SFN). Considerable experimental 
( [Itzhaki et ai, 1995| ; |Main et al, 1999j ; |Martinez et al., 1998| ; |Chiti et al, 1999Q and theoretical 
( [Abkevich et ai, 1994| |Klimov fc Thirumalai, 



et ai, 2000|; pokholyan et al~. 



effort has been devoted to identification of folding nuclei in real proteins and various models as 
well as factors that determine its location in structure and in sequence. 

One of the most intriguing aspect of nucleation-condensation mechanism of protein fold- 
ing is its relation to protein evolution. Indeed residues constituting folding nucleus can be 
metaphorically considered "accelerator pedals" of folding ( |Mirny et ai, 1998a| ) since mutations 
in those positions affect folding rate to a much greater extent than elsewhere in a protein. 
One can conclude that if there is evolutionary control of folding rate it should have resulted in 
additional pressure applied on folding nucleus residues, and such pressure can be manifested in 
noticeable additional conservation of nucleus residues. 

This idea was first proposed in (|Shakhnovich et al., 1996|) where it was applied to prediction 
of nucleus residues from protein structure. Many sequences were designed to fit the structure 
of Chymotripsin Inhibitor 2 (CI2) with low energy. Positions conserved among the designed 
sequences were identified as a putative nucleus. This way blind predictions of folding nucleus 
in CI2 were made that were verified in independent experiments flltzhaki et ai, 1995| ). 

In related studies papers Ptitsyn studied conservatism in distant yet related by sequence 
homology members of Cytochrome C ( [Ptitsyn, 1998| ) and myoglobin ( [Ptitsyn fc Ting, 1999| ) 
families. In both cases he found conserved clusters of residues without an obvious functional role 
which he suggested to belong to folding nucleus of those proteins. Michnick and Shakhnovich 
( Michnick fc Shakhnovich, 1998 ) carried out an analysis of conservation in natural and designed 
sequences for families of three structurally related proteins - ubiquitin, raf and ferredoxin and 
predicted possible folding nucleus for those proteins. 



2 



Neverteheless the notion of folding nucleus conservation has drawn some controvercy in 
the lietrature. While earlier papers ( [Shakhnovich et al., 1996| ; |Michnick fc Shakhnovich, 1998 



|Ptitsyn, 1998| ; |Ptitsyn fc Ting, 1999|) suggested conservation of folding nucleus in some proteins, 
a more recent paper by Plaxco and coauthors QPlaxco et al, 2000|) argued to the opposite. These 
authors looked at conservatism profile in several protein families for which protein engineering 
analysis of folding transition states has been carried out, and did not observe correlation between 
conservation and experimentally measured 0-values. This made them conclude that there is no 
evolutionary pressure to control the folding rates. 

In this work we study evolutionary conservation of the folding nucleus for several homologous 
proteins. Conservation of the folding nucleus is systematically compared with the conservation 
in the rest of the protein sequence. In contrast to previous studies, we perform rigorous statis- 
tical test to assess significance of higher conservation in the folding nucleus. The main result 
of this study is that for all studied proteins, except AcP, folding nucleus is significantly more 
conserved than the rest of the protein. We explain the difference between our thorough statis- 
tical analysis and that of Plaxco et al ( Plaxco et al., 20001 ) by pointing out to some technical 
shortcomings in the earlier work ( |Plaxco et al., 20"00| ). 



Results and Discussion 

To study evolutionary conservation of the folding nucleus we turn to nine proteins for which 
nucleus has been experimentally identified from protein engineering analysis: CI2, FKBP12, 
ACBP, CheY, Tenascin, CD2.dl, U1A, AcP and ADA2h. For each of them we obtain a multiple 
sequence alignment from HSSP database ( Podge et al, 19981) (or PFAM ( Bateman et al, 2000| ) 
database if HSSP contains too few sequences). We compute variability at position I of the 
alignment as 

*(0 = -£ft(0i°gi»*(Q (i) 

where pi(l) is the frequency of residues from class i in position I. We use six classes of residues 
to reflect physical-chemical properties of amino acids and their natural pattern of substitutions: 
aliphatic [A V L I M C], aromatic [F W Y H], polar [S T N Q], basic [K R], acidic [D E], and 
special (reflecting their special conformational properties) [G P]. As a result of this classification 
mutations within a class are ignored (e.g. V — > L), while mutations that change the class are 
taken into account. Figure 1 presents variability profile for studied proteins with nucleation 
positions marked by filled circles. Importantly, we defined the folding nucleus as it was identified Fig.l 
by the original experimental groups (Table 1). 

Figure 2 clearly shows that nucleus residues are almost always among the most conserved Fig. 2 
ones for all studied proteins. It also shows that nucleus residues are not the only conserved 
ones: many other residues (predominantly in the cores of the proteins) are also conserved. 

In order to evaluate statistical significance of nucleus conservation we compare evolutionary 
conservation of the folding nucleus with the conservation of all residues in the protein using 
the following statistical test. We start from the null hypothesis HO that nucleus residues are 
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no more conserved than the whole protein sequence. To test this hypothesis we compute 
median variability of the nucleus residues (med[s nuc ]) and compare it with the distribution 
of medians variability of the same number of residues randomly chosen in the same protein 
(f(med[s ran d\)). The distribution /(med[s mn J) is obtained by choosing 10 5 random sets of 
n residues (n is the number of residues in the nucleus). Then the fraction of instances with 
med[s ran ,i] < med[s nuc ] gives the probability P Q of accepting HO. In other words, P is the 
probability that observed lower variability of the folding nucleus is obtained by chance. Hence, 
Pq < a indicates statistically significant strong evolutionary conservation of the folding nucleus. 
Below we use confidence level a = 2%. 

Table 2 presents computed Pq values. The main result of this work is that in all proteins, 
except AcP, residues in the folding nucleus are significantly more conserved than the rest of the 
protein. 

Next we study how obtained results depend on the way amino acids are grouped into 
classes (see Table 2). When classification scheme from ( [Branden fc Tooze, 1998|) (BT) is 
used, still all proteins except AcP exhibit significant conservation of the folding nucleus. This 
clearly demonstrates that observed conservation of the folding nucleus is not a consequence of 
a particular choice of the classification scheme. 

However, when amino acids are not grouped into classes, nucleus exhibits significant conser- 
vation only in four out of nine proteins. Taken together these results indicate that substitutions 
in the folding nucleus may occur, but they are limited to residues that belong to the same class 
(i.e. have similar physical-chemical properties ( [Thompson fc Goldstein, 1996Q ). 

To study what physical-chemical properties are conserved in the folding nucleus we used 
various classification schemes. Starting from all 20 amino acids, we grouped some of them 
into classes and repeated the analysis, including the statistical tests (see Table 2). The goal 
is to find a minimal classification (i.e. grouping the minimal number of amino acids together) 
that provides statistically significant conservation of the folding nucleus. Our results show that 
classification where only I, L, and V are grouped in one class while all other amino acids each 
represent their own class satisfies this requirement (see Table 2). This classification provides 
significant conservation of the nucleus for all proteins except AcP with a = 5%, and for all 
proteins except AcP and FKBP12 with a = 2%. This result demonstrates that / # L # V 
are the most common substitutions in the nucleus (and in the protein core in general ( [Henikofi 
& Hjenikofi', 1992 ; Benner et al., 1994]) ). These substitutions are tolerated in the nucleus as 



they do not change much neither stability of the native fold nor the folding rate. Analysis 
of available experimental data (L.Li unpublished) shows that changes in stability upon / ^ 
L # V mutations are in average (AAGn-d) = 1.0 ± 0.4kCal mol -1 for the native state and 
(AAG$-d) = 0.2 ± 0.3 kCal mol -1 for the transition state. 

Note that grouping of residues into classes to assess conservation is similar to the use of 
substitution matrices in sequence alignment techniques. The underlying idea for both methods 
is to take into account natural physical-chemical similarity between amino acids and their 
substitution patterns. Plaxco et all used all 20 types of amino acids and failed to identify 
strong conservation of the folding nucleus ( |Plaxco et al., 2000|) . Similarly, a method that 
relies on simple sequence identity cannot detect distant homology. However distant homology 
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between sequences can be detected using proper substitution matrices ([Abagyan fc Batal 



ov 



199'|; [Brenner" et al, 1998Q . The use of substitution matrices is physically meaningful since 
they weight, e.g., I — V match higher then I — D, while a method that relays on percentage 
of sequence identity weights I — V and I — D equally. Likewise, our amino acid classification 
scheme does not count I — > V as a mutation, while it certainly considers substitutions like 
I — > D as mutations to be counted. 

Although, on average, nucleus is more conserved, than the rest of the protein, not all 
nucleating residues are strongly conserved. For example, in CheY two out of ten nucleation 
residues are not conserved. In ADA2h two out of five and in tenascin one out of four residues are 
not conserved. Some nucleus residues may be less conserved because they belong to "extended 
nucleus" ( [Mirny fc Shakhnovich, 1999p or because of limitation of our residues classification 
scheme that puts aromatic and aliphatic residues into two different groups, while aromatic- 
aliphatic substitutions may occur in the core of some proteins (i.e. tenascin, ADA2h) usually as 
a result of correlated mutations that are not treated properly in this approach (but are taken into 
account in the conservation-of-conservation approach ( Mirny fc Shaklmovich, 1999] )). Another 



interesting observation is that the only protein that exhibits no preferential conservation of the 
folding nucleus is AcP, which is the slowest folding protein among all studied two-state folding 
proteins (k^ 2 ° = 0.23s -1 ). Perhaps, this protein did not undergo evolutionary selection for 
faster folding and hence its folding nucleus is under no additional pressure to be conserved. 

Note that, as expected, several other residues in studied proteins are as conserved as the 
nucleating ones, (see Fig. 2) Those are the residues of the active site, core hydrophobic residues 
responsible for stabilization of the native structure and others. This suggests that although 
folding nucleus is conserved it can not be uniquely identified just by analysis of a single protein 
family as a pattern of conservation is dominated by residues conserved for protein stability and 
function (see ( [Mirny fc El, | )). Thus a consistent analysis should discriminate between residues 



that are conserved for functional reasons, for stability reasons and for kinetic reasons (folding 
nucleus), like it was done in a more detailed conservation-of-conservation analysis in ( |Mirny fc] 
Sha klmovich, 1999] ) . 

Why do results of our analysis differ from those of Plaxco et al ( [Flaxco et al., 2000 )? First, 



we took into account physical-chemical properties of amino acids and their natural substitution 
patterns to group amino acids into classes. As we showed, substitutions of large aliphatic 
residues (I,L,V) are frequent in folding nuclei and this confused previous analysis that did not 
apply any amino acid classification scheme. While Plaxco et al claimed in their paper flPlaxco 



et a \., 2000| ) (without providing a supporting evidence) that grouping of amino acids into classes 
did not change their conclusions, our analysis shows that proper classification of amino acids is 
crucial for detecting conservation in the folding nucleus. 

Second, Plaxco et al used a different method to compute sequence variability: 

* a (0 = -£ft(0i°gM)/p?] (2) 

i 

This equation differs from eq. ([!]), used in this study, in normalization by p° - the "background" 
frequency of residue type % in all proteins. Although the difference may seem technical, equations 
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(HD and (0) are based on two different models of evolution. We argue that while equation ^ 
may be adequate for DNA sequence analysis flStormo, 1998|) it is not appropriate for analysis 
of protein evolution. 

Equation ||] implicitly assumes that amino acid composition p® is fixed a priori in each 
protein. Hence equation (Q) tends to underestimate conservation of "frequent" amino acids 
(L,A,S etc), while overestimating conservation of less frequent amino acids (W,C,H etc). In 
contrast, equation (|I]) assumes that conservation requirement itself affects the composition, i.e. 
higher conservation of an amino acid leads to its higher frequency in proteins. 

To illustrate this point consider a toy protein that consists of two types of residues: hy- 
drophobic H and polar P. Assume that 70% of amino acids in this proteins are in the core 
and 30% are in the loops. Also assume that in the toy world selection for stability requires a 
100% conservation of H amino acids in the core, while loops are under no evolutionary pressure 
and H and P are equally probable in the loops. Then p° H = 1 • 0.70 + 0.5 • 0.3 = 0.85 and 
p° P = 0.5 ■ 0.3 = 0.15. At conserved core positions S2(core) = —1 log 1/0.85 ~ —0.16, while in 
the loops s 2 (loops) = —0.5 log 0.5/0.85 — 0.5 log 0.5/0.15 ~ —0.34. Hence, the use of equation 
(|2| leads to a counterintuitive and apparently wrong result s 2 (core) > s 2 (loops), i.e. that loops 
are more conserved than 100% conserved core! Clearly this result shows inadequacy of equation 
(0) as applied to protein evolution with unconstrained composition. Similarly, application of 
equation | to real proteins leads to unreasonably low conservation of the hydrophobic core as 
compared to exposed loops (data not shown). 

A possible way to compensate for variations in amino acid composition of proteins is to 
define the sequence entropy as in flSchneider, 1999|) : 



s(i) = -£ft(0 iogft(z) + Erf lo srf (3) 

i i 

where the second term gives the "background" variability due to amino acid composition. This 
term however does not depend on I and hence does not change the relative variability. 

Interestingly, the use of equation (0) by Plaxco et al ( |Plaxco et ai, 20001 ) gave rise to a 
surprising result that active sites in proteins are generally no more conserved than the rest of 
the protein (see Fig. 2 of ([Plaxco et ai., 2000|)). Conservation of known active sites was used 



as a control in (Plaxco et al., 20"0C ) for their method of analysis based on equation |2| which it 
apparently failed. 

Finally, Plaxco et al did not study conservation of the folding nucleus. Instead, they focused 
on the residues that featured high 0-values in protein engineering experiments and compared 
them with low 0-value residues. As we explained above residues in the folding nucleus do not 
necessarily exhibit high 0- values, and many low 0- value residues are conserved in evolution as 
they contribute to stabilization of the native structure. Comparison with low 0-value residues 
instead of comparison with the whole protein also confused previous analysis since most of </>- 
values have been measured for amino acids located in the the core of a protein and hence these 
amino acids are on average more conserved. Here, in contrast, we used the folding nucleus as it 
was identified for each protein by the original experimental group and compared its conservation 
with the conservation of all amino acids in the protein. 
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In summary, we showed that folding nucleus is indeed conserved in most of the proteins 
whose folding transition states are known from protein engineering analysis. That does not 
mean that folding nucleus residues are the the only conserved ones in any family of homologous 
proteins. That also may not mean that folding nucleus is more conserved than other residues in 
the protein core, as nucleus is equally important for protein stability and for fast folding. Our 
result show that the folding nucleus is more conserved than the rest of the protein. As stated 
earlier it is difficult to uniquely identify folding nucleus by looking at a conservation profile in 
just one family of homologous sequences. Nevertheless conservation of folding nucleus found in 
this paper and in other works ( |Mirny fc Shakhnovich, 1999| ; [Li et al., 2000| ) points out to an 
exciting possibility that folding rates may be of biological significance. Biological significance 
of this fact needs to be assessed in future studies. 
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Figure Captions 

Fig.l Variability profiles (sequence entropy) for nine different proteins computed using MS 
residue classes. Circles indicate positions at which 0-values have been experimentally measured. 
Residues forming the folding nucleus are shown by filled circles. 

Fig. 2 Nine studied proteins with Cg atoms colored according to the degree of their con- 
servation (evaluated in Fig.l): from blue (high conservation) to light-blue, green, yellow and 
red (no conservation). Folding nucleus residues are shown by twice as large spheres. Notice 
conserved (blue) cores of the proteins and non-conserved (yellow and red) surfaces. Also notice 
several conserved non-nucleus residues in the protein core. 
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Table 1: Folding nuclei as identified by the authors 
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Table 2: Probability Pq of nucleus being as conserved as the whole protein (see text for details) 
computed for all nine proteins and seven different classification schemes. MS as in ( |Mirny et al.,\ 
1998 U [Mfrny fc Shaklmovich, 1999|) , BT as in flBranden fc Tooze, 19981) : hydrophobic [A V F 
P M I L], polar [S T Y H C N Q W], basic [R K], acidic [D E],gly [G]), N dass - number of groups 
in each classification 
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